Lesson 4


Scatterplots and Perceived Audience Size

Notes:


Scatterplots

Notes:

library(ggplot2)
pf <- read.csv('pseudo_facebook.tsv', sep = '\t')
ggplot(aes(x = age, y = friend_count), data = pf) +
  geom_point(alpha = 1/20) +
  xlim(13, 90)
## Warning: Removed 4906 rows containing missing values (geom_point).


ggplot(aes(age, friendships_initiated ), data = pf) +
  geom_point(alpha = 1/10) +
  xlim(13, 90) +
  coord_trans(y = 'sqrt')
## Warning: Removed 4906 rows containing missing values (geom_point).

What are some things that you notice right away?

Response:


ggplot Syntax

Notes:


Overplotting

Notes:

What do you notice in the plot?

Response:


Coord_trans()

Notes:

Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!

What do you notice?


Alpha and Jitter

Notes:


Overplotting and Domain Knowledge

Notes:


Conditional Means

Notes:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
pf.fc_by_age <- pf %>%
  group_by(age)%>%
  summarise( friend_count_mean = mean(friend_count),
             friend_count_median = median(friend_count),
             n = n())%>%
  arrange(age)

Create your plot!


Overlaying Summaries with Raw Data

Notes:

ggplot(aes(x = age, y = friend_count), data = pf) +
  xlim(13, 90) +
  geom_point(alpha = 1/20, position = position_jitter(h=0), color = 'orange') +
  coord_trans(y = 'sqrt') +
  geom_line(stat = 'summary', fun.y = mean) +
  geom_line(stat = 'summary', fun.y = quantile, color = 'blue', fun.args = list(probs = 0.9), linetype = 2) +
  geom_line(stat = 'summary', fun.y = quantile, color = 'blue', fun.args = list(probs = 0.5)) +
  geom_line(stat = 'summary', fun.y = quantile, color = 'blue', fun.args = list(probs = 0.1), linetype = 2)
## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 5199 rows containing missing values (geom_point).

What are some of your observations of the plot?

Response: ?coord_cartesian() ***

Moira: Histogram Summary and Scatterplot

See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.

Notes:


Correlation

Notes:

?cor.test

Look up the documentation for the cor.test function.

What’s the correlation between age and friend count? Round to three decimal places. Response:


Correlation on Subsets

Notes:


Correlation Methods

Notes:


Create Scatterplots

Notes:

colnames(pf)
##  [1] "userid"                "age"                  
##  [3] "dob_day"               "dob_year"             
##  [5] "dob_month"             "gender"               
##  [7] "tenure"                "friend_count"         
##  [9] "friendships_initiated" "likes"                
## [11] "likes_received"        "mobile_likes"         
## [13] "mobile_likes_received" "www_likes"            
## [15] "www_likes_received"
ggplot(aes(x = www_likes_received, y = likes_received),data = pf) +
  geom_point() +
  coord_cartesian(xlim = c(0, 10000), ylim = c(0, 40000))

cor.test(pf$www_likes_receive, pf$likes_received)
## 
##  Pearson's product-moment correlation
## 
## data:  pf$www_likes_receive and pf$likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9473553 0.9486176
## sample estimates:
##       cor 
## 0.9479902

Strong Correlations

Notes:

What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.

Response:


Moira on Correlation

Notes:


More Caution with Correlation

Notes:

Create your plot!

library(alr3)
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
data(Mitchell)
ggplot(aes(x = Month, y = Temp), data = Mitchell) +
  geom_point()


Noisy Scatterplots

  1. Take a guess for the correlation coefficient for the scatterplot.

  2. What is the actual correlation of the two variables? (Round to the thousandths place)

cor.test(Mitchell$Month, Mitchell$Temp)
## 
##  Pearson's product-moment correlation
## 
## data:  Mitchell$Month and Mitchell$Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08053637  0.19331562
## sample estimates:
##        cor 
## 0.05747063

?Mitchell ### Making Sense of Data Notes:

ggplot(aes(x = Month, y = Temp), data = Mitchell) +
  geom_point() +
  scale_x_continuous(breaks = seq(0, 204, 12))


A New Perspective

What do you notice? Response:

Watch the solution video and check out the Instructor Notes! Notes:


Understanding Noise: Age to Age Months

Notes:

pf$age_with_months <- pf$age + (1 - pf$dob_month / 12)

Age with Months Means

The original line plot with age (not age with month) was like this:
p1 <- ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) +
  geom_line()

p1

library(dplyr)
pf.fc_by_age_months <- pf%>%
  group_by(age_with_months)%>%
  summarise(friend_count_mean = mean(friend_count),
            friend_count_median = median(friend_count),
            n = n())%>%
  arrange(age_with_months)

head(pf.fc_by_age_months)
## # A tibble: 6 x 4
##   age_with_months friend_count_mean friend_count_median     n
##             <dbl>             <dbl>               <dbl> <int>
## 1            13.2              46.3                30.5     6
## 2            13.2             115.                 23.5    14
## 3            13.3             136.                 44      25
## 4            13.4             164.                 72      33
## 5            13.5             131.                 66      45
## 6            13.6             157.                 64      54
p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months,age_with_months < 71)) +
  geom_line()

p2

####Now, plot the previous two plots together side by side:

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(p1, p2, ncol = 1)

Programming Assignment

data("diamonds")
colnames(diamonds)
##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"
ggplot(aes(x = x, y = price), data = diamonds) + 
  geom_point()


Noise in Conditional Means


Smoothing Conditional Means

Notes:


Which Plot to Choose?

Notes:


Analyzing Two Variables

Reflection:


Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!